USO DEL ANÁLISIS DE AGRUPAMIENTO DE PAM PARA AGRUPAR PAÍSES POR RIQUEZA, DESARROLLO, EMISIONES DE CARBONO Y FELICIDAD

TABLA DE CONTENIDOS SUGERIDO

  1. IMPORTACIÓN DE BIBLIOTECAS

  2. CARGA DE DATOS

  3. LIMPIEZA DE DATOS

  4. PREPROCESAMIENTO DE DATOS

  5. SIEMPRE (CASI) ESCALAR LOS DATOS.

  6. ANÁLISIS DE COMPONENTES PRINCIPALES (PCA)

  7. USO DEL ANÁLISIS DE AGRUPAMIENTO DE PAM PARA AGRUPAR PAÍSES POR RIQUEZA, DESARROLLO, EMISIONES DE CARBONO Y FELICIDAD

  8. UN MAPA MUNDIAL DE LOS CLÚSTERES

FIN

1. - IMPORTACION DE BIBLIOTECAS

Importar librerias

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(stringr)
library(cluster)
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(reshape2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.2
library(NbClust)
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.2.2
library(psych) 
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.2.2

2. - CARGA DE DATOS

Cargando el conjunto de datos

# importar los datos usando la libreria "xlsx"
library(xlsx)
## Warning: package 'xlsx' was built under R version 4.2.2
hpi <- read.xlsx('data.xlsx',sheetIndex = 5, header = TRUE)

Mostrar contenido del dataset

# MOSTRAR LAS PRIMERAS FILAS
head(hpi)
##   NA.            NA..1       NA..2                        NA..3
## 1  NA Complete dataset        <NA>                         <NA>
## 2  NA             <NA>        <NA>                         <NA>
## 3  NA             <NA>        <NA>                         <NA>
## 4  NA             <NA>        <NA>                         <NA>
## 5  NA         HPI Rank     Country                       Region
## 6  NA              110 Afghanistan Middle East and North Africa
##                       NA..4                     NA..5            NA..6
## 1                      <NA>                      <NA>             <NA>
## 2                      <NA>                      <NA>             <NA>
## 3                      <NA>                      <NA>             <NA>
## 4                      <NA>                      <NA>             <NA>
## 5 Average Life \nExpectancy Average Wellbeing\n(0-10) Happy Life Years
## 6                    59.668                       3.8 12.3960238087402
##                     NA..7                  NA..8
## 1                    <NA>                   <NA>
## 2                    <NA>                   <NA>
## 3                    <NA>                   <NA>
## 4                    <NA>                   <NA>
## 5 Footprint\n(gha/capita) Inequality of Outcomes
## 6                    0.79      0.426557441321005
##                                 NA..9                        NA..10
## 1                                <NA>                          <NA>
## 2                                <NA>                          <NA>
## 3                                <NA>                          <NA>
## 4                                <NA>                          <NA>
## 5 Inequality-adjusted Life Expectancy Inequality-adjusted Wellbeing
## 6                    38.3488176249152              3.39049431767374
##               NA..11             NA..12     NA..13           NA..14 NA..15
## 1               <NA>               <NA>       <NA>             <NA>     NA
## 2               <NA>               <NA>       <NA>             <NA>     NA
## 3               <NA>               <NA>       <NA>             <NA>     NA
## 4               <NA>               <NA>       <NA>             <NA>     NA
## 5 Happy Planet Index GDP/capita\n($PPP) Population       GINI index     NA
## 6   20.2253497709571   690.842629014956   29726803 Data unavailable     NA
##   NA..16 NA..17 NA..18 NA..19 NA..20 NA..21 NA..22 NA..23 NA..24 NA..25 NA..26
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 5     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 6     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##   NA..27 NA..28 NA..29 NA..30 NA..31 NA..32 NA..33 NA..34 NA..35 NA..36 NA..37
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 5     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 6     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##   NA..38 NA..39 NA..40 NA..41
## 1     NA     NA     NA     NA
## 2     NA     NA     NA     NA
## 3     NA     NA     NA     NA
## 4     NA     NA     NA     NA
## 5     NA     NA     NA     NA
## 6     NA     NA     NA     NA
# MOSTRAR LAS ULTIMAS FILAS
tail(hpi)
##     NA. NA..1
## 158  NA  <NA>
## 159  NA  <NA>
## 160  NA  <NA>
## 161  NA  <NA>
## 162  NA  <NA>
## 163  NA  <NA>
##                                                                                                            NA..2
## 158                                                                                                         <NA>
## 159                                                                                          GDP/capita \n($PPP)
## 160                                                                                                   Population
## 161                                                                                                         GINI
## 162 For further information about the data sources and calculation methodology, please see the Methodology paper
## 163                                                                                                         <NA>
##                                                                                   NA..3
## 158                                                                                <NA>
## 159 2012 data (or nearest available year)  World Development Indicators, The World Bank
## 160 2012 data (or nearest available year)  World Development Indicators, The World Bank
## 161 2012 data (or nearest available year)  World Development Indicators, The World Bank
## 162                                                                                <NA>
## 163                                                                                <NA>
##     NA..4 NA..5 NA..6 NA..7 NA..8 NA..9 NA..10 NA..11 NA..12 NA..13 NA..14
## 158  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
## 159  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
## 160  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
## 161  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
## 162  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
## 163  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
##     NA..15 NA..16 NA..17 NA..18 NA..19 NA..20 NA..21 NA..22 NA..23 NA..24
## 158     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 159     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 160     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 161     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 162     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 163     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##     NA..25 NA..26 NA..27 NA..28 NA..29 NA..30 NA..31 NA..32 NA..33 NA..34
## 158     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 159     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 160     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 161     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 162     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 163     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##     NA..35 NA..36 NA..37 NA..38 NA..39 NA..40 NA..41
## 158     NA     NA     NA     NA     NA     NA     NA
## 159     NA     NA     NA     NA     NA     NA     NA
## 160     NA     NA     NA     NA     NA     NA     NA
## 161     NA     NA     NA     NA     NA     NA     NA
## 162     NA     NA     NA     NA     NA     NA     NA
## 163     NA     NA     NA     NA     NA     NA     NA

- Se observa que hay presencia de datos faltantes tanto en filas y columnas, asi como también e información poco relevante

3. - LIMPIEZA DE DATOS

identificacion y tratamiento de datos nulos

# Mostrar la cantidad de datos nulos por cada columna
colSums(is.na(hpi))
##    NA.  NA..1  NA..2  NA..3  NA..4  NA..5  NA..6  NA..7  NA..8  NA..9 NA..10 
##    163     21     13     15     22     22     22     22     22     22     22 
## NA..11 NA..12 NA..13 NA..14 NA..15 NA..16 NA..17 NA..18 NA..19 NA..20 NA..21 
##     22     22     22     22    163    163    163    163    163    163    163 
## NA..22 NA..23 NA..24 NA..25 NA..26 NA..27 NA..28 NA..29 NA..30 NA..31 NA..32 
##    163    163    163    163    163    163    163    163    163    163    163 
## NA..33 NA..34 NA..35 NA..36 NA..37 NA..38 NA..39 NA..40 NA..41 
##    163    163    163    163    163    163    163    163    163
# Mostrar % de datos nulos por cada columna
porcentajeMiss <- function(x) {sum(is.na(x)) / length(x)*100}
apply(hpi, 2, porcentajeMiss)
##        NA.      NA..1      NA..2      NA..3      NA..4      NA..5      NA..6 
## 100.000000  12.883436   7.975460   9.202454  13.496933  13.496933  13.496933 
##      NA..7      NA..8      NA..9     NA..10     NA..11     NA..12     NA..13 
##  13.496933  13.496933  13.496933  13.496933  13.496933  13.496933  13.496933 
##     NA..14     NA..15     NA..16     NA..17     NA..18     NA..19     NA..20 
##  13.496933 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 
##     NA..21     NA..22     NA..23     NA..24     NA..25     NA..26     NA..27 
## 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 
##     NA..28     NA..29     NA..30     NA..31     NA..32     NA..33     NA..34 
## 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 
##     NA..35     NA..36     NA..37     NA..38     NA..39     NA..40     NA..41 
## 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000

-Se observa que desde la columna “NA..15” en adelante presentan un 100% de valores nulos, los cuales tendremos que eliminar ya que no poseen ningún dato, al igual que la columna “NA.”

Eliminando contenido poco relevante

# Eliminando columnas con datos nulos
hpi_temp <- hpi[,-c(1,16:42)]
head(hpi_temp)
##              NA..1       NA..2                        NA..3
## 1 Complete dataset        <NA>                         <NA>
## 2             <NA>        <NA>                         <NA>
## 3             <NA>        <NA>                         <NA>
## 4             <NA>        <NA>                         <NA>
## 5         HPI Rank     Country                       Region
## 6              110 Afghanistan Middle East and North Africa
##                       NA..4                     NA..5            NA..6
## 1                      <NA>                      <NA>             <NA>
## 2                      <NA>                      <NA>             <NA>
## 3                      <NA>                      <NA>             <NA>
## 4                      <NA>                      <NA>             <NA>
## 5 Average Life \nExpectancy Average Wellbeing\n(0-10) Happy Life Years
## 6                    59.668                       3.8 12.3960238087402
##                     NA..7                  NA..8
## 1                    <NA>                   <NA>
## 2                    <NA>                   <NA>
## 3                    <NA>                   <NA>
## 4                    <NA>                   <NA>
## 5 Footprint\n(gha/capita) Inequality of Outcomes
## 6                    0.79      0.426557441321005
##                                 NA..9                        NA..10
## 1                                <NA>                          <NA>
## 2                                <NA>                          <NA>
## 3                                <NA>                          <NA>
## 4                                <NA>                          <NA>
## 5 Inequality-adjusted Life Expectancy Inequality-adjusted Wellbeing
## 6                    38.3488176249152              3.39049431767374
##               NA..11             NA..12     NA..13           NA..14
## 1               <NA>               <NA>       <NA>             <NA>
## 2               <NA>               <NA>       <NA>             <NA>
## 3               <NA>               <NA>       <NA>             <NA>
## 4               <NA>               <NA>       <NA>             <NA>
## 5 Happy Planet Index GDP/capita\n($PPP) Population       GINI index
## 6   20.2253497709571   690.842629014956   29726803 Data unavailable

- Observando el archivo excel se detectó que hay datos nulos e innecesarios al inicio y al final del dataset

# Eliminando filas con datos nulos y contenido no relevante
hpi_clean <- hpi_temp[-c(1:5,146:163),]
head(hpi_clean)
##    NA..1       NA..2                        NA..3  NA..4 NA..5            NA..6
## 6    110 Afghanistan Middle East and North Africa 59.668   3.8 12.3960238087402
## 7     13     Albania               Post-communist 77.347   5.5 34.4147360108721
## 8     30     Algeria Middle East and North Africa 74.313   5.6 30.4694613112307
## 9     19   Argentina                     Americas 75.927   6.5 40.1666738745799
## 10    73     Armenia               Post-communist 74.446   4.3 24.0187600607023
## 11   105   Australia                 Asia Pacific 82.052   7.2 53.0694977095264
##    NA..7              NA..8            NA..9           NA..10           NA..11
## 6   0.79  0.426557441321005 38.3488176249152 3.39049431767374 20.2253497709571
## 7   2.21  0.165133717343957 69.6711592798365 5.09765040048744   36.76687443911
## 8   2.12  0.244861747889588 60.4745445936025  5.1964491855983 33.3005427886154
## 9   3.14  0.164238301077711 68.3495831090017 6.03470703959787 35.1902436422946
## 10  2.23  0.216648103296311 66.9216823230103 3.74713959801918 25.6664172884369
## 11  9.31 0.0806782476710599 78.5600237730927  6.9133349323794 21.2289663082594
##              NA..12   NA..13           NA..14
## 6  690.842629014956 29726803 Data unavailable
## 7  4247.48543681502  2900489            28.96
## 8  5583.61615950131 37439427 Data unavailable
## 9  14357.4115893903 42095224            42.49
## 10  3565.5175749254  2978339            30.48
## 11 67646.1038529626 22728254 Data unavailable

Renombrando las columnas

names(hpi_clean) <- c("HPI_Rank",   "Country", "Region",    "Average_Life_Expectancy", "Average_Wellbeing", "Happy_Life_Years", "Footprint", "Inequality_of_Outcomes",  "Inequality_adjusted_Life_Expectancy",  "Inequality_adjusted_Wellbeing",    "Happy_Planet_Index",    "GDP_capita",  "Population",   "GINI_index")
rownames(hpi_clean) <- NULL # resetear los indices
head(hpi_clean)
##   HPI_Rank     Country                       Region Average_Life_Expectancy
## 1      110 Afghanistan Middle East and North Africa                  59.668
## 2       13     Albania               Post-communist                  77.347
## 3       30     Algeria Middle East and North Africa                  74.313
## 4       19   Argentina                     Americas                  75.927
## 5       73     Armenia               Post-communist                  74.446
## 6      105   Australia                 Asia Pacific                  82.052
##   Average_Wellbeing Happy_Life_Years Footprint Inequality_of_Outcomes
## 1               3.8 12.3960238087402      0.79      0.426557441321005
## 2               5.5 34.4147360108721      2.21      0.165133717343957
## 3               5.6 30.4694613112307      2.12      0.244861747889588
## 4               6.5 40.1666738745799      3.14      0.164238301077711
## 5               4.3 24.0187600607023      2.23      0.216648103296311
## 6               7.2 53.0694977095264      9.31     0.0806782476710599
##   Inequality_adjusted_Life_Expectancy Inequality_adjusted_Wellbeing
## 1                    38.3488176249152              3.39049431767374
## 2                    69.6711592798365              5.09765040048744
## 3                    60.4745445936025               5.1964491855983
## 4                    68.3495831090017              6.03470703959787
## 5                    66.9216823230103              3.74713959801918
## 6                    78.5600237730927               6.9133349323794
##   Happy_Planet_Index       GDP_capita Population       GINI_index
## 1   20.2253497709571 690.842629014956   29726803 Data unavailable
## 2     36.76687443911 4247.48543681502    2900489            28.96
## 3   33.3005427886154 5583.61615950131   37439427 Data unavailable
## 4   35.1902436422946 14357.4115893903   42095224            42.49
## 5   25.6664172884369  3565.5175749254    2978339            30.48
## 6   21.2289663082594 67646.1038529626   22728254 Data unavailable

- Verificando que el dataset se encuentre libre de datos nulos

sum(is.na(hpi_clean)) 
## [1] 0

4. - PREPROCESAMIENTO DE DATOS

corregir el tipo de dato de cada columna

# Se observa que el tipo de dato en la tabla no son los correctos respecto a su contenido
hpi_clean$HPI_Rank <- as.integer(hpi_clean$HPI_Rank)
hpi_clean$Region <- as.factor(hpi_clean$Region)
hpi_clean$Average_Life_Expectancy <- as.numeric(hpi_clean$Average_Life_Expectancy)
hpi_clean$Average_Wellbeing <- as.numeric(hpi_clean$Average_Wellbeing)
hpi_clean$Happy_Life_Years <- as.numeric(hpi_clean$Happy_Life_Years)
hpi_clean$Footprint <- as.numeric(hpi_clean$Footprint)
hpi_clean$Inequality_of_Outcomes <- as.numeric(hpi_clean$Inequality_of_Outcomes)
hpi_clean$Inequality_adjusted_Life_Expectancy <- as.numeric(hpi_clean$Inequality_adjusted_Life_Expectancy)
hpi_clean$Inequality_adjusted_Wellbeing <- as.numeric(hpi_clean$Inequality_adjusted_Wellbeing)
hpi_clean$Happy_Planet_Index <- as.numeric(hpi_clean$Happy_Planet_Index)
hpi_clean$GDP_capita <- as.numeric(hpi_clean$GDP_capita)
hpi_clean$Population <- as.numeric(hpi_clean$Population)
hpi_clean$GINI_index <- as.factor(hpi_clean$GINI_index)

descripción de los datos

str(hpi_clean)
## 'data.frame':    140 obs. of  14 variables:
##  $ HPI_Rank                           : int  110 13 30 19 73 105 43 8 102 87 ...
##  $ Country                            : chr  "Afghanistan" "Albania" "Algeria" "Argentina" ...
##  $ Region                             : Factor w/ 6 levels "Americas","Asia Pacific",..: 4 5 4 1 5 2 3 2 5 3 ...
##  $ Average_Life_Expectancy            : num  59.7 77.3 74.3 75.9 74.4 ...
##  $ Average_Wellbeing                  : num  3.8 5.5 5.6 6.5 4.3 7.2 7.4 4.7 5.7 6.9 ...
##  $ Happy_Life_Years                   : num  12.4 34.4 30.5 40.2 24 ...
##  $ Footprint                          : num  0.79 2.21 2.12 3.14 2.23 9.31 6.06 0.72 5.09 7.44 ...
##  $ Inequality_of_Outcomes             : num  0.427 0.165 0.245 0.164 0.217 ...
##  $ Inequality_adjusted_Life_Expectancy: num  38.3 69.7 60.5 68.3 66.9 ...
##  $ Inequality_adjusted_Wellbeing      : num  3.39 5.1 5.2 6.03 3.75 ...
##  $ Happy_Planet_Index                 : num  20.2 36.8 33.3 35.2 25.7 ...
##  $ GDP_capita                         : num  691 4247 5584 14357 3566 ...
##  $ Population                         : num  29726803 2900489 37439427 42095224 2978339 ...
##  $ GINI_index                         : Factor w/ 65 levels "24.74","25.59",..: 65 15 65 50 18 65 18 65 4 13 ...

Mostrando el resumen estadistico

summary(hpi_clean)
##     HPI_Rank        Country                                   Region  
##  Min.   :  1.00   Length:140         Americas                    :25  
##  1st Qu.: 35.75   Class :character   Asia Pacific                :21  
##  Median : 70.50   Mode  :character   Europe                      :20  
##  Mean   : 70.50                      Middle East and North Africa:14  
##  3rd Qu.:105.25                      Post-communist              :26  
##  Max.   :140.00                      Sub Saharan Africa          :34  
##                                                                       
##  Average_Life_Expectancy Average_Wellbeing Happy_Life_Years   Footprint     
##  Min.   :48.91           Min.   :2.867     Min.   : 8.97    Min.   : 0.610  
##  1st Qu.:65.04           1st Qu.:4.575     1st Qu.:18.69    1st Qu.: 1.425  
##  Median :73.50           Median :5.250     Median :29.40    Median : 2.680  
##  Mean   :70.93           Mean   :5.408     Mean   :30.25    Mean   : 3.258  
##  3rd Qu.:77.02           3rd Qu.:6.225     3rd Qu.:39.71    3rd Qu.: 4.482  
##  Max.   :83.57           Max.   :7.800     Max.   :59.32    Max.   :15.820  
##                                                                             
##  Inequality_of_Outcomes Inequality_adjusted_Life_Expectancy
##  Min.   :0.04322        Min.   :27.32                      
##  1st Qu.:0.13353        1st Qu.:48.21                      
##  Median :0.21174        Median :63.41                      
##  Mean   :0.23291        Mean   :60.34                      
##  3rd Qu.:0.32932        3rd Qu.:72.57                      
##  Max.   :0.50734        Max.   :81.26                      
##                                                            
##  Inequality_adjusted_Wellbeing Happy_Planet_Index   GDP_capita      
##  Min.   :2.421                 Min.   :12.78      Min.   :   244.2  
##  1st Qu.:4.047                 1st Qu.:21.21      1st Qu.:  1628.1  
##  Median :4.816                 Median :26.29      Median :  5691.1  
##  Mean   :4.973                 Mean   :26.41      Mean   : 13911.1  
##  3rd Qu.:5.704                 3rd Qu.:31.54      3rd Qu.: 15159.1  
##  Max.   :7.625                 Max.   :44.71      Max.   :105447.1  
##                                                                     
##    Population                   GINI_index
##  Min.   :2.475e+05   Data unavailable:75  
##  1st Qu.:4.248e+06   30.48           : 2  
##  Median :1.065e+07   24.74           : 1  
##  Mean   :4.801e+07   25.59           : 1  
##  3rd Qu.:3.343e+07   25.9            : 1  
##  Max.   :1.351e+09   26.01           : 1  
##                      (Other)         :59

- Dimensiones del dataset despues de la limpieza

dim(hpi_clean)
## [1] 140  14

Análisis y visualización de los datos

- Graficando el primer grupo de datos

pairs.panels(hpi_clean[,1:6], pch=21,main="Gráfico : Matriz de Dispersión, Histograma y Correlación",bg=c("blue"))

Se observa una alta correlacion positiva con respecto a las variables “Average_life_expectancy”, “Average_wellbeing” y “Happy_life_years”

- Graficando el segundo grupo de datos

pairs.panels(hpi_clean[,7:13], pch=21,main="Gráfico : Matriz de Dispersión, Histograma y Correlación",bg=c("blue"))

Se observa que al relacionar algunas caracteristicas correlaciones negativas, positivas y nula

plot_histogram(hpi_clean,ggtheme = theme_linedraw())

Se observa que algunas caracteristicas presentan distribucion de campana como “Average_wellbeing”, las caracteristicas “Happy_life_years” y “life_expectancy” presentan distribucion bimodal y las caracteristicas “footprint” y “GDP_capita” presentan una distribucion asimétrica positiva

Busqueda de valores atípicos

boxplot(hpi_clean[,3:13],main = "Boxplot del índice del planeta feliz 2016",
        boxwex = 0.5,col="purple")

Se observa gran cantidad de valores atipicos en la caracteristica “Population” y “GDP_capita”, para lo cual se realizará el respectivo tratamiento de outliers

Tratamiento de outliers

# creando copia
hpi_copy <- hpi_clean
# "Population"
qrts <- quantile(hpi_copy$Population, probs = c(0.25, 0.75), na.rm = TRUE)
caps <- quantile(hpi_copy$Population, probs = c(.05, .95), na.rm = TRUE)
iqr <- qrts[2]-qrts[1]
h <- 1.5 * iqr
hpi_copy$Population[hpi_copy$Population<qrts[1]-h] <- caps[1]
hpi_copy$Population[hpi_copy$Population>qrts[2]+h] <- caps[2]
# "GDP_capita"
qrts <- quantile(hpi_copy$GDP_capita, probs = c(0.25, 0.75), na.rm = TRUE)
caps <- quantile(hpi_copy$GDP_capita, probs = c(.05, .95), na.rm = TRUE)
iqr <- qrts[2]-qrts[1]
h <- 1.5 * iqr
hpi_copy$GDP_capita[hpi_copy$GDP_capita<qrts[1]-h] <- caps[1]
hpi_copy$GDP_capita[hpi_copy$GDP_capita>qrts[2]+h] <- caps[2]
dim(hpi_copy)
## [1] 140  14

- Se observa que no hubo perdida de datos y se redujo en gran medida los outliers

boxplot(hpi_copy[,3:13],main = "Boxplot del índice del planeta feliz 2016",
        boxwex = 0.5,col="purple")

5. - ESCALAR LOS DATOS

- Se procede a normalizar los datos para trabajar en una misma escala

hpi_sin_outliers <- hpi_copy
hpi_scale <- hpi_sin_outliers[, 4:13]
hpi_scale  <- scale(hpi_scale)
summary(hpi_scale)
##  Average_Life_Expectancy Average_Wellbeing Happy_Life_Years     Footprint      
##  Min.   :-2.5153         Min.   :-2.2128   Min.   :-1.60493   Min.   :-1.1493  
##  1st Qu.:-0.6729         1st Qu.:-0.7252   1st Qu.:-0.87191   1st Qu.:-0.7955  
##  Median : 0.2939         Median :-0.1374   Median :-0.06378   Median :-0.2507  
##  Mean   : 0.0000         Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.6968         3rd Qu.: 0.7116   3rd Qu.: 0.71388   3rd Qu.: 0.5317  
##  Max.   : 1.4449         Max.   : 2.0831   Max.   : 2.19247   Max.   : 5.4532  
##  Inequality_of_Outcomes Inequality_adjusted_Life_Expectancy
##  Min.   :-1.5692        Min.   :-2.2192                    
##  1st Qu.:-0.8222        1st Qu.:-0.8152                    
##  Median :-0.1751        Median : 0.2060                    
##  Mean   : 0.0000        Mean   : 0.0000                    
##  3rd Qu.: 0.7976        3rd Qu.: 0.8221                    
##  Max.   : 2.2702        Max.   : 1.4059                    
##  Inequality_adjusted_Wellbeing Happy_Planet_Index   GDP_capita     
##  Min.   :-2.1491               Min.   :-1.86308   Min.   :-0.7659  
##  1st Qu.:-0.7795               1st Qu.:-0.71120   1st Qu.:-0.6851  
##  Median :-0.1317               Median :-0.01653   Median :-0.4476  
##  Mean   : 0.0000               Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.6162               3rd Qu.: 0.70106   3rd Qu.: 0.1058  
##  Max.   : 2.2339               Max.   : 2.50110   Max.   : 2.2309  
##    Population      
##  Min.   :-0.66932  
##  1st Qu.:-0.58537  
##  Median :-0.45111  
##  Mean   : 0.00000  
##  3rd Qu.: 0.02693  
##  Max.   : 2.59717

Analisis de correlacion

ggcorrplot(round(cor(hpi_scale), 1), type = "lower", lab = T, show.legend = T,tl.srt=45)

- Se observa una correlacion muy baja o nula respecto a las caracteristicas de “population”, esto quiere decir que es una caracteristicas poco relevante para el proceso de clusterizacion ya que no aporta mucha información.

- La caracteristica “inequality_of_outcome” presenta una correlacion negatica fuere respecto a las otras caracteristicas, lo cual indica que cuando una de las variables aumenta, la otra variable disminuye y viceversa.

6. - ANÁLISIS DE COMPONENTES PRINCIPALES (PCA)

hpi_clean.pca <- PCA(hpi_scale, graph=FALSE)
print(hpi_clean.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 140 individuals, described by 10 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
fviz_screeplot(hpi_clean.pca, addlabels = TRUE, ylim = c(0, 75)) +
  theme_classic()

Según el gráfico, nos muestra que las 2 primeras dimensiones explica el 81.1% de representatividad de todo el conjunto de datos.

Adicionalmente las 2 primeras dimenciones retienen aproximadamente el 80% de las varianzas contenidas en el conjunto de datos.

Las 5 últimas componentes no superan por separado el 1% de varianza explicada.

NOTA: El método PCA es altamente sensible a outliers

fviz_pca_var(hpi_clean.pca, col.var="contrib",
             repel = TRUE 
             )+scale_colour_viridis_c() 

Segun el grafico de influencias se observa que la mayoria de las variables que apuntan a la derecha tienen influencias positivas grandes en la dimension 1 y la variable “inequality_of_outcome” tiene influencia negativa en la dimension 1 y 2

Segun el grafico se observa que la variable “population presenta una ortogonalidad respecto a la mayoria de variables, lo cual quiere decir que es una variable poco relevante”

7. - USO DEL ANÁLISIS DE AGRUPAMIENTO DE PAM

Estimación del número de clusters

Método del codo:

set.seed(2023)
ks <- 2:10
fviz_nbclust(hpi_scale, kmeans, method = "wss")+geom_vline(xintercept = 3, color = "red", linetype = 2)

Método de la silueta:

fviz_nbclust(hpi_scale, kmeans, method = "silhouette") 

Método de la estadística de brecha

fviz_nbclust(hpi_scale, kmeans, method = "gap_stat")

La mayoría de las técnicas aplicadas para estimar el número de clusters, sugieren que se debe agrupar en 3 clusters

PROPUESTA DE ANÁLISIS PAM 1: En base a las variables solicitadas como paises por riqueza, desarrollo, emiciones de carbono y felicidad

#hpi_scale2 = hpi_scale[, 1:3]
hpi_scale2 = hpi_scale[, c(1,2,3,8)]
set.seed(2023)
pam <- pam(hpi_scale2, diss=FALSE, k = 3, keep.data=TRUE)
hpi_clean$Country[pam$id.med]
## [1] "Liberia" "Romania" "Germany"
fviz_cluster(pam, stand = FALSE, geom = "point",
             ellipse.type = "norm", ggtheme = theme_classic())

**EVALUACION DE ALGORIRMO PAM PROPUESTA 1 *

fviz_silhouette(pam)
##   cluster size ave.sil.width
## 1       1   42          0.48
## 2       2   60          0.31
## 3       3   38          0.40

PROPUESTA DE ANÁLISIS PAM 2: En base a la sugerencia de la técnica de PCA

Haciendo uso de las variables con mayor representatividad

hpi_scale2 = hpi_scale[, 1:3]
set.seed(2023)
pam <- pam(hpi_scale2, diss=FALSE, k = 3, keep.data=TRUE)
hpi_clean$Country[pam$id.med]
## [1] "Liberia"        "Romania"        "United Kingdom"
fviz_cluster(pam, stand = FALSE, geom = "point",
             ellipse.type = "norm", ggtheme = theme_classic())

library(plotly)
hpi_clean['Cluster'] <- as.factor(pam$clustering)
plot_ly(hpi_clean, x = ~Average_Life_Expectancy, y = ~Average_Wellbeing, z = ~Happy_Life_Years) %>%
  add_markers(color = ~Cluster)

EVALUACION DE ALGORIRMO PAM PROPUESTA 2

fviz_silhouette(pam)
##   cluster size ave.sil.width
## 1       1   42          0.50
## 2       2   62          0.43
## 3       3   36          0.55

Respecto a los resultados obtenidos en las evaluaciones, se observa que para la propuesta 1 se obtuvo valores de (0.48, 0.31, 0.40) para cada cluster respectivamente con un promedio final de 0.38, sin embargo, para la propuesta 2 con PCA se obtuvo valores de (0.50, 0.43, 0.51) para cada cluster respectivamente con un promedio final de 0.48

Esto quiere decir que aplicando la tecnica de PCA el agrupamiento mejora significativamente, ya que reduce variables que aportan poca informacion o irrelevantes

EXTRAYENDO INFORMACIÓN DEL CLUSTER 1

cluster1 <- hpi_clean[hpi_clean$Cluster == 1,]
cluster1 <- cluster1[, c("Country", "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Happy_Planet_Index", "GDP_capita")]
cluster1$Country
##  [1] "Afghanistan"       "Benin"             "Botswana"         
##  [4] "Burkina Faso"      "Burundi"           "Cambodia"         
##  [7] "Cameroon"          "Chad"              "Comoros"          
## [10] "Cote d'Ivoire"     "Djibouti"          "Ethiopia"         
## [13] "Gabon"             "Ghana"             "Guinea"           
## [16] "Haiti"             "India"             "Kenya"            
## [19] "Lesotho"           "Liberia"           "Malawi"           
## [22] "Mauritania"        "Mozambique"        "Myanmar"          
## [25] "Namibia"           "Nepal"             "Niger"            
## [28] "Nigeria"           "Pakistan"          "Republic of Congo"
## [31] "Rwanda"            "Senegal"           "Sierra Leone"     
## [34] "South Africa"      "Swaziland"         "Syria"            
## [37] "Tanzania"          "Togo"              "Uganda"           
## [40] "Yemen"             "Zambia"            "Zimbabwe"
summary(cluster1[, c(2:6)])
##  Average_Life_Expectancy Average_Wellbeing   Footprint     Happy_Planet_Index
##  Min.   :48.91           Min.   :2.867     Min.   :0.610   Min.   :12.78     
##  1st Qu.:56.50           1st Qu.:3.900     1st Qu.:1.030   1st Qu.:16.62     
##  Median :60.27           Median :4.300     Median :1.225   Median :19.93     
##  Mean   :59.78           Mean   :4.269     Mean   :1.419   Mean   :20.41     
##  3rd Qu.:63.34           3rd Qu.:4.696     3rd Qu.:1.498   3rd Qu.:23.01     
##  Max.   :70.39           Max.   :5.500     Max.   :3.830   Max.   :31.50     
##    GDP_capita     
##  Min.   :  244.2  
##  1st Qu.:  668.8  
##  Median :  996.0  
##  Mean   : 1754.3  
##  3rd Qu.: 1552.5  
##  Max.   :10642.4

EXTRAYENDO INFORMACIÓN DEL CLUSTER 2

cluster2 <- hpi_clean[hpi_clean$Cluster == 2,]
cluster2 <- cluster2[, c("Country", "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Happy_Planet_Index", "GDP_capita")]
cluster2$Country
##  [1] "Albania"                "Algeria"                "Armenia"               
##  [4] "Bangladesh"             "Belarus"                "Belize"                
##  [7] "Bhutan"                 "Bolivia"                "Bosnia and Herzegovina"
## [10] "Bulgaria"               "China"                  "Colombia"              
## [13] "Croatia"                "Dominican Republic"     "Ecuador"               
## [16] "Egypt"                  "El Salvador"            "Estonia"               
## [19] "Georgia"                "Greece"                 "Guatemala"             
## [22] "Honduras"               "Hong Kong"              "Hungary"               
## [25] "Indonesia"              "Iran"                   "Iraq"                  
## [28] "Jamaica"                "Kazakhstan"             "Kyrgyzstan"            
## [31] "Latvia"                 "Lebanon"                "Lithuania"             
## [34] "Macedonia"              "Malaysia"               "Mauritius"             
## [37] "Mongolia"               "Montenegro"             "Morocco"               
## [40] "Nicaragua"              "Palestine"              "Paraguay"              
## [43] "Peru"                   "Philippines"            "Poland"                
## [46] "Portugal"               "Romania"                "Russia"                
## [49] "Serbia"                 "Slovakia"               "Sri Lanka"             
## [52] "Suriname"               "Tajikistan"             "Thailand"              
## [55] "Trinidad and Tobago"    "Tunisia"                "Turkey"                
## [58] "Turkmenistan"           "Ukraine"                "Uzbekistan"            
## [61] "Vanuatu"                "Vietnam"
summary(cluster2[, c(2:6)])
##  Average_Life_Expectancy Average_Wellbeing   Footprint     Happy_Planet_Index
##  Min.   :65.30           Min.   :4.200     Min.   :0.720   Min.   :14.27     
##  1st Qu.:70.75           1st Qu.:4.800     1st Qu.:1.890   1st Qu.:23.66     
##  Median :73.94           Median :5.400     Median :2.705   Median :27.43     
##  Mean   :73.22           Mean   :5.318     Mean   :3.159   Mean   :28.22     
##  3rd Qu.:75.02           3rd Qu.:5.800     3rd Qu.:3.910   3rd Qu.:34.14     
##  Max.   :83.57           Max.   :6.470     Max.   :8.823   Max.   :40.70     
##    GDP_capita     
##  Min.   :  858.9  
##  1st Qu.: 3404.4  
##  Median : 5680.8  
##  Mean   : 7545.4  
##  3rd Qu.: 9652.5  
##  Max.   :36707.8

EXTRAYENDO INFORMACIÓN DEL CLUSTER 3

cluster3 <- hpi_clean[hpi_clean$Cluster == 3,]
cluster3 <- cluster3[, c("Country", "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Happy_Planet_Index", "GDP_capita")]
cluster3$Country
##  [1] "Argentina"                "Australia"               
##  [3] "Austria"                  "Belgium"                 
##  [5] "Brazil"                   "Canada"                  
##  [7] "Chile"                    "Costa Rica"              
##  [9] "Cyprus"                   "Czech Republic"          
## [11] "Denmark"                  "Finland"                 
## [13] "France"                   "Germany"                 
## [15] "Iceland"                  "Ireland"                 
## [17] "Israel"                   "Italy"                   
## [19] "Japan"                    "Luxembourg"              
## [21] "Malta"                    "Mexico"                  
## [23] "Netherlands"              "New Zealand"             
## [25] "Norway"                   "Oman"                    
## [27] "Panama"                   "Slovenia"                
## [29] "South Korea"              "Spain"                   
## [31] "Sweden"                   "Switzerland"             
## [33] "United Kingdom"           "United States of America"
## [35] "Uruguay"                  "Venezuela"
summary(cluster3[, c(2:6)])
##  Average_Life_Expectancy Average_Wellbeing   Footprint      Happy_Planet_Index
##  Min.   :73.89           Min.   :5.800     Min.   : 2.790   Min.   :13.15     
##  1st Qu.:79.01           1st Qu.:6.475     1st Qu.: 4.322   1st Qu.:27.85     
##  Median :80.55           Median :6.950     Median : 5.290   Median :30.60     
##  Mean   :79.98           Mean   :6.892     Mean   : 5.572   Mean   :30.30     
##  3rd Qu.:81.68           3rd Qu.:7.325     3rd Qu.: 6.100   3rd Qu.:34.33     
##  Max.   :83.24           Max.   :7.800     Max.   :15.820   Max.   :44.71     
##    GDP_capita    
##  Min.   :  9703  
##  1st Qu.: 20792  
##  Median : 40178  
##  Mean   : 39057  
##  3rd Qu.: 49101  
##  Max.   :105447

8. - UN MAPA MUNDIAL DE LOS CLÚSTERES

map <- map_data("world")
map$region[map$region == "USA"] <- "United States of America"
map1 <- left_join(map, hpi_clean[, c("Region", "Country",   "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Inequality_of_Outcomes",      "Happy_Planet_Index",    "GDP_capita",  "Population", "Cluster")], by = c("region" = "Country"))
HPI_map <- ggplot(map1) + 
  geom_polygon(aes(x = long, y = lat, group = group, fill = Cluster, colour = Cluster)) +
  coord_equal() +
  labs(title = "Agrupación Índice Planeta Feliz", subtitle = "", x = NULL, y = NULL) +
  theme_linedraw() +
  theme(plot.title = element_text(face = "bold"), plot.subtitle = element_text(face = "italic"),
        legend.position = "bottom", legend.justification = "center", 
        legend.title = element_text(face = "bold"),
        )

HPI_map

Cluster 1 (bajo): Este grupo lo conforman la mayoria de los paises del continente africano, donde el PBI medio es de 1754.3, con una puntuacion de bienestar de 4.269, una esperanza de vida de 59.78 años, con emisiones de carbono de 1.419 y adicionalmente la felicidad tiene una puntuacion de 20.41

Cluster 2 (medio): Este grupo lo conforman algunos paises de America del sur como Perú, Guatemala, paises de Asia y paises pertenecientes al continente Europeo, donde el PBI medio es de 7545.4, con una puntuacion de bienestar de 5.318, una esperanza de vida de 73.22 años, con emisiones de carbono de 3.159 y adicionalmente la felicidad tiene una puntuacion de 28.22

Cluster 3 (alto): Este grupo lo conforman paises de America del norte y algunos de America del sur como Brazil, Chile, tambien lo conforman paises del continente de Oceania y algunos paises de Europa, donde el PBI medio es de 39057, con una puntuacion de bienestar de 6.892, una esperanza de vida de 79.98 años, con emisiones de carbono de 5.572 y adicionalmente la felicidad tiene una puntuacion de 30.30

Otros: representado por el color gris(no se tiene información)